{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# NLP with TextBlob"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "TextBlob is a python library that provides a simple API for common NLP tasks and builds on the Natural Language Toolkit (nltk) and the Pattern web mining libraries. TextBlob facilitates part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and others."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Imports & Settings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:11:34.788997Z",
     "start_time": "2020-06-20T17:11:34.786749Z"
    }
   },
   "outputs": [],
   "source": [
    "import warnings\n",
    "warnings.filterwarnings('ignore')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:11:35.389589Z",
     "start_time": "2020-06-20T17:11:34.790664Z"
    }
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "from pathlib import Path\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "# Visualization\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "\n",
    "# spacy, textblob and nltk for language processing\n",
    "from textblob import TextBlob, Word\n",
    "import nltk\n",
    "from nltk.stem.snowball import SnowballStemmer\n",
    "\n",
    "# sklearn for feature extraction & modeling\n",
    "from sklearn.feature_extraction.text import CountVectorizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "# download NLTK resources\n",
    "nltk.download('punkt')"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:11:35.394824Z",
     "start_time": "2020-06-20T17:11:35.390720Z"
    }
   },
   "outputs": [],
   "source": [
    "sns.set_style('white')\n",
    "np.random.seed(42)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load BBC Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To illustrate the use of TextBlob, we sample a BBC sports article with the headline ‘Robinson ready for difficult task’. Similar to spaCy and other libraries, the first step is to pass the document through a pipeline represented by the TextBlob object to assign annotations required for various tasks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:11:35.469843Z",
     "start_time": "2020-06-20T17:11:35.396079Z"
    }
   },
   "outputs": [],
   "source": [
    "path = Path('..', 'data', 'bbc')\n",
    "files = sorted(list(path.glob('**/*.txt')))\n",
    "doc_list = []\n",
    "for i, file in enumerate(files):\n",
    "    topic = file.parts[-2]\n",
    "    article = file.read_text(encoding='latin1').split('\\n')\n",
    "    heading = article[0].strip()\n",
    "    body = ' '.join([l.strip() for l in article[1:]]).strip()\n",
    "    doc_list.append([topic, heading, body])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:11:35.477229Z",
     "start_time": "2020-06-20T17:11:35.470871Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 2225 entries, 0 to 2224\n",
      "Data columns (total 3 columns):\n",
      " #   Column   Non-Null Count  Dtype \n",
      "---  ------   --------------  ----- \n",
      " 0   topic    2225 non-null   object\n",
      " 1   heading  2225 non-null   object\n",
      " 2   body     2225 non-null   object\n",
      "dtypes: object(3)\n",
      "memory usage: 52.3+ KB\n"
     ]
    }
   ],
   "source": [
    "docs = pd.DataFrame(doc_list, columns=['topic', 'heading', 'body'])\n",
    "docs.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-11-21T15:03:08.577908Z",
     "start_time": "2018-11-21T15:03:08.572433Z"
    }
   },
   "source": [
    "## Introduction to TextBlob\n",
    "\n",
    "You should already have downloaded TextBlob, a Python library used to explore common NLP tasks."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Select random article"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:11:35.484187Z",
     "start_time": "2020-06-20T17:11:35.478329Z"
    }
   },
   "outputs": [],
   "source": [
    "article = docs.sample(1).squeeze()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:11:35.492788Z",
     "start_time": "2020-06-20T17:11:35.485942Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Topic:\tBusiness\n",
      "\n",
      "UK house prices dip in November\n",
      "\n",
      "UK house prices dipped slightly in November, the Office of the Deputy Prime Minister (ODPM) has said.  The average house price fell marginally to Â£180,226, from Â£180,444 in October. Recent evidence has suggested that the UK housing market is slowing after interest rate increases, and economists forecast a drop in prices during 2005. But while the monthly figures may hint at a cooling of the market, annual house price inflation is still strong, up 13.8% in the year to November. Economists, however, forecast that ODPM figures are likely to show a weakening in annual house price growth in coming months. \"Overall, the housing market activity is slowing down and that is backed up by the mortgage lending and the mortgage approvals data,\" said Mark Miller, at HBOS Treasury Services. \"The ODPM data is a fairly lagging indicator.\"  The figures come after the Bank of England said the number of mortgages approved in the UK has fallen to the lowest level for nearly a decade. The Halifax, meanwhile, said last week that house prices increased by 1.1% in December - the first monthly rise since September.  The UK's biggest mortgage lender said prices rose 15.1% over the whole of 2004, but by only 2.8% in the second half of the year. It is predicting a 2% fall in overall prices in 2005 as the market stabilises after large gains in recent years. The ODPM attributed the monthly fall of prices in November to a drop in the value of detached houses and flats. It said annual inflation rose between October and November because prices had fallen by 1.1% in the same period in 2003.  The ODPM data showed the average house price was Â£192,713 in England; Â£139,544 in Wales; Â£116,542 in Scotland, and Â£111,314 in Northern Ireland.  All areas saw a rise in annual house price inflation in November except for Northern Ireland and the West Midlands, where the rate was unchanged, the ODPM said. The North East showed the highest rate of inflation at 26.2%, followed by Yorkshire and the Humber on 21.7%, and the North West on 21.1%. The East Midlands, the West Midlands and the South West all had an annual inflation rate of more than 15%. In London, the area with the highest average house price at Â£262,825, annual inflation rose only slightly in November to 7.1% from 7% the previous month.\n"
     ]
    }
   ],
   "source": [
    "print(f'Topic:\\t{article.topic.capitalize()}\\n\\n{article.heading}\\n')\n",
    "print(article.body.strip())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:11:35.500823Z",
     "start_time": "2020-06-20T17:11:35.493728Z"
    }
   },
   "outputs": [],
   "source": [
    "parsed_body = TextBlob(article.body)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tokenization"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:11:35.520173Z",
     "start_time": "2020-06-20T17:11:35.502616Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "WordList(['UK', 'house', 'prices', 'dipped', 'slightly', 'in', 'November', 'the', 'Office', 'of', 'the', 'Deputy', 'Prime', 'Minister', 'ODPM', 'has', 'said', 'The', 'average', 'house', 'price', 'fell', 'marginally', 'to', 'Â£180,226', 'from', 'Â£180,444', 'in', 'October', 'Recent', 'evidence', 'has', 'suggested', 'that', 'the', 'UK', 'housing', 'market', 'is', 'slowing', 'after', 'interest', 'rate', 'increases', 'and', 'economists', 'forecast', 'a', 'drop', 'in', 'prices', 'during', '2005', 'But', 'while', 'the', 'monthly', 'figures', 'may', 'hint', 'at', 'a', 'cooling', 'of', 'the', 'market', 'annual', 'house', 'price', 'inflation', 'is', 'still', 'strong', 'up', '13.8', 'in', 'the', 'year', 'to', 'November', 'Economists', 'however', 'forecast', 'that', 'ODPM', 'figures', 'are', 'likely', 'to', 'show', 'a', 'weakening', 'in', 'annual', 'house', 'price', 'growth', 'in', 'coming', 'months', 'Overall', 'the', 'housing', 'market', 'activity', 'is', 'slowing', 'down', 'and', 'that', 'is', 'backed', 'up', 'by', 'the', 'mortgage', 'lending', 'and', 'the', 'mortgage', 'approvals', 'data', 'said', 'Mark', 'Miller', 'at', 'HBOS', 'Treasury', 'Services', 'The', 'ODPM', 'data', 'is', 'a', 'fairly', 'lagging', 'indicator', 'The', 'figures', 'come', 'after', 'the', 'Bank', 'of', 'England', 'said', 'the', 'number', 'of', 'mortgages', 'approved', 'in', 'the', 'UK', 'has', 'fallen', 'to', 'the', 'lowest', 'level', 'for', 'nearly', 'a', 'decade', 'The', 'Halifax', 'meanwhile', 'said', 'last', 'week', 'that', 'house', 'prices', 'increased', 'by', '1.1', 'in', 'December', 'the', 'first', 'monthly', 'rise', 'since', 'September', 'The', 'UK', \"'s\", 'biggest', 'mortgage', 'lender', 'said', 'prices', 'rose', '15.1', 'over', 'the', 'whole', 'of', '2004', 'but', 'by', 'only', '2.8', 'in', 'the', 'second', 'half', 'of', 'the', 'year', 'It', 'is', 'predicting', 'a', '2', 'fall', 'in', 'overall', 'prices', 'in', '2005', 'as', 'the', 'market', 'stabilises', 'after', 'large', 'gains', 'in', 'recent', 'years', 'The', 'ODPM', 'attributed', 'the', 'monthly', 'fall', 'of', 'prices', 'in', 'November', 'to', 'a', 'drop', 'in', 'the', 'value', 'of', 'detached', 'houses', 'and', 'flats', 'It', 'said', 'annual', 'inflation', 'rose', 'between', 'October', 'and', 'November', 'because', 'prices', 'had', 'fallen', 'by', '1.1', 'in', 'the', 'same', 'period', 'in', '2003', 'The', 'ODPM', 'data', 'showed', 'the', 'average', 'house', 'price', 'was', 'Â£192,713', 'in', 'England', 'Â£139,544', 'in', 'Wales', 'Â£116,542', 'in', 'Scotland', 'and', 'Â£111,314', 'in', 'Northern', 'Ireland', 'All', 'areas', 'saw', 'a', 'rise', 'in', 'annual', 'house', 'price', 'inflation', 'in', 'November', 'except', 'for', 'Northern', 'Ireland', 'and', 'the', 'West', 'Midlands', 'where', 'the', 'rate', 'was', 'unchanged', 'the', 'ODPM', 'said', 'The', 'North', 'East', 'showed', 'the', 'highest', 'rate', 'of', 'inflation', 'at', '26.2', 'followed', 'by', 'Yorkshire', 'and', 'the', 'Humber', 'on', '21.7', 'and', 'the', 'North', 'West', 'on', '21.1', 'The', 'East', 'Midlands', 'the', 'West', 'Midlands', 'and', 'the', 'South', 'West', 'all', 'had', 'an', 'annual', 'inflation', 'rate', 'of', 'more', 'than', '15', 'In', 'London', 'the', 'area', 'with', 'the', 'highest', 'average', 'house', 'price', 'at', 'Â£262,825', 'annual', 'inflation', 'rose', 'only', 'slightly', 'in', 'November', 'to', '7.1', 'from', '7', 'the', 'previous', 'month'])"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "parsed_body.words"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Sentence boundary detection"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:11:35.659254Z",
     "start_time": "2020-06-20T17:11:35.652487Z"
    },
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[Sentence(\"UK house prices dipped slightly in November, the Office of the Deputy Prime Minister (ODPM) has said.\"),\n",
       " Sentence(\"The average house price fell marginally to Â£180,226, from Â£180,444 in October.\"),\n",
       " Sentence(\"Recent evidence has suggested that the UK housing market is slowing after interest rate increases, and economists forecast a drop in prices during 2005.\"),\n",
       " Sentence(\"But while the monthly figures may hint at a cooling of the market, annual house price inflation is still strong, up 13.8% in the year to November.\"),\n",
       " Sentence(\"Economists, however, forecast that ODPM figures are likely to show a weakening in annual house price growth in coming months.\"),\n",
       " Sentence(\"\"Overall, the housing market activity is slowing down and that is backed up by the mortgage lending and the mortgage approvals data,\" said Mark Miller, at HBOS Treasury Services.\"),\n",
       " Sentence(\"\"The ODPM data is a fairly lagging indicator.\"\"),\n",
       " Sentence(\"The figures come after the Bank of England said the number of mortgages approved in the UK has fallen to the lowest level for nearly a decade.\"),\n",
       " Sentence(\"The Halifax, meanwhile, said last week that house prices increased by 1.1% in December - the first monthly rise since September.\"),\n",
       " Sentence(\"The UK's biggest mortgage lender said prices rose 15.1% over the whole of 2004, but by only 2.8% in the second half of the year.\"),\n",
       " Sentence(\"It is predicting a 2% fall in overall prices in 2005 as the market stabilises after large gains in recent years.\"),\n",
       " Sentence(\"The ODPM attributed the monthly fall of prices in November to a drop in the value of detached houses and flats.\"),\n",
       " Sentence(\"It said annual inflation rose between October and November because prices had fallen by 1.1% in the same period in 2003.\"),\n",
       " Sentence(\"The ODPM data showed the average house price was Â£192,713 in England; Â£139,544 in Wales; Â£116,542 in Scotland, and Â£111,314 in Northern Ireland.\"),\n",
       " Sentence(\"All areas saw a rise in annual house price inflation in November except for Northern Ireland and the West Midlands, where the rate was unchanged, the ODPM said.\"),\n",
       " Sentence(\"The North East showed the highest rate of inflation at 26.2%, followed by Yorkshire and the Humber on 21.7%, and the North West on 21.1%.\"),\n",
       " Sentence(\"The East Midlands, the West Midlands and the South West all had an annual inflation rate of more than 15%.\"),\n",
       " Sentence(\"In London, the area with the highest average house price at Â£262,825, annual inflation rose only slightly in November to 7.1% from 7% the previous month.\")]"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "parsed_body.sentences"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Stemming"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To perform stemming, we instantiate the SnowballStemmer from the nltk library, call its .stem() method on each token and display tokens that were modified as a result:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:11:35.678296Z",
     "start_time": "2020-06-20T17:11:35.660778Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('house', 'hous'),\n",
       " ('prices', 'price'),\n",
       " ('dipped', 'dip'),\n",
       " ('slightly', 'slight'),\n",
       " ('November', 'novemb'),\n",
       " ('Office', 'offic'),\n",
       " ('Deputy', 'deputi'),\n",
       " ('Minister', 'minist'),\n",
       " ('average', 'averag'),\n",
       " ('house', 'hous'),\n",
       " ('marginally', 'margin'),\n",
       " ('October', 'octob'),\n",
       " ('evidence', 'evid'),\n",
       " ('suggested', 'suggest'),\n",
       " ('housing', 'hous'),\n",
       " ('slowing', 'slow'),\n",
       " ('increases', 'increas'),\n",
       " ('economists', 'economist'),\n",
       " ('prices', 'price'),\n",
       " ('during', 'dure'),\n",
       " ('monthly', 'month'),\n",
       " ('figures', 'figur'),\n",
       " ('cooling', 'cool'),\n",
       " ('house', 'hous'),\n",
       " ('inflation', 'inflat'),\n",
       " ('November', 'novemb'),\n",
       " ('Economists', 'economist'),\n",
       " ('however', 'howev'),\n",
       " ('figures', 'figur'),\n",
       " ('likely', 'like'),\n",
       " ('weakening', 'weaken'),\n",
       " ('house', 'hous'),\n",
       " ('coming', 'come'),\n",
       " ('months', 'month'),\n",
       " ('Overall', 'overal'),\n",
       " ('housing', 'hous'),\n",
       " ('activity', 'activ'),\n",
       " ('slowing', 'slow'),\n",
       " ('backed', 'back'),\n",
       " ('mortgage', 'mortgag'),\n",
       " ('lending', 'lend'),\n",
       " ('mortgage', 'mortgag'),\n",
       " ('approvals', 'approv'),\n",
       " ('Treasury', 'treasuri'),\n",
       " ('Services', 'servic'),\n",
       " ('fairly', 'fair'),\n",
       " ('lagging', 'lag'),\n",
       " ('indicator', 'indic'),\n",
       " ('figures', 'figur'),\n",
       " ('mortgages', 'mortgag'),\n",
       " ('approved', 'approv'),\n",
       " ('nearly', 'near'),\n",
       " ('decade', 'decad'),\n",
       " ('meanwhile', 'meanwhil'),\n",
       " ('house', 'hous'),\n",
       " ('prices', 'price'),\n",
       " ('increased', 'increas'),\n",
       " ('December', 'decemb'),\n",
       " ('monthly', 'month'),\n",
       " ('since', 'sinc'),\n",
       " ('September', 'septemb'),\n",
       " ('mortgage', 'mortgag'),\n",
       " ('prices', 'price'),\n",
       " ('only', 'onli'),\n",
       " ('predicting', 'predict'),\n",
       " ('overall', 'overal'),\n",
       " ('prices', 'price'),\n",
       " ('stabilises', 'stabilis'),\n",
       " ('large', 'larg'),\n",
       " ('gains', 'gain'),\n",
       " ('years', 'year'),\n",
       " ('attributed', 'attribut'),\n",
       " ('monthly', 'month'),\n",
       " ('prices', 'price'),\n",
       " ('November', 'novemb'),\n",
       " ('value', 'valu'),\n",
       " ('detached', 'detach'),\n",
       " ('houses', 'hous'),\n",
       " ('flats', 'flat'),\n",
       " ('inflation', 'inflat'),\n",
       " ('October', 'octob'),\n",
       " ('November', 'novemb'),\n",
       " ('because', 'becaus'),\n",
       " ('prices', 'price'),\n",
       " ('showed', 'show'),\n",
       " ('average', 'averag'),\n",
       " ('house', 'hous'),\n",
       " ('Wales', 'wale'),\n",
       " ('areas', 'area'),\n",
       " ('house', 'hous'),\n",
       " ('inflation', 'inflat'),\n",
       " ('November', 'novemb'),\n",
       " ('Midlands', 'midland'),\n",
       " ('unchanged', 'unchang'),\n",
       " ('showed', 'show'),\n",
       " ('inflation', 'inflat'),\n",
       " ('followed', 'follow'),\n",
       " ('Yorkshire', 'yorkshir'),\n",
       " ('Midlands', 'midland'),\n",
       " ('Midlands', 'midland'),\n",
       " ('inflation', 'inflat'),\n",
       " ('average', 'averag'),\n",
       " ('house', 'hous'),\n",
       " ('inflation', 'inflat'),\n",
       " ('only', 'onli'),\n",
       " ('slightly', 'slight'),\n",
       " ('November', 'novemb')]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Initialize stemmer.\n",
    "stemmer = SnowballStemmer('english')\n",
    "\n",
    "# Stem each word.\n",
    "[(word, stemmer.stem(word)) for i, word in enumerate(parsed_body.words) \n",
    " if word.lower() != stemmer.stem(parsed_body.words[i])]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Lemmatization"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:11:35.756527Z",
     "start_time": "2020-06-20T17:11:35.679204Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[nltk_data] Downloading package wordnet to /home/stefan/nltk_data...\n",
      "[nltk_data]   Package wordnet is already up-to-date!\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import nltk\n",
    "nltk.download('wordnet')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:11:36.655613Z",
     "start_time": "2020-06-20T17:11:35.757665Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('prices', 'price'),\n",
       " ('has', 'ha'),\n",
       " ('has', 'ha'),\n",
       " ('increases', 'increase'),\n",
       " ('economists', 'economist'),\n",
       " ('prices', 'price'),\n",
       " ('figures', 'figure'),\n",
       " ('figures', 'figure'),\n",
       " ('months', 'month'),\n",
       " ('approvals', 'approval'),\n",
       " ('figures', 'figure'),\n",
       " ('mortgages', 'mortgage'),\n",
       " ('has', 'ha'),\n",
       " ('prices', 'price'),\n",
       " ('prices', 'price'),\n",
       " ('prices', 'price'),\n",
       " ('as', 'a'),\n",
       " ('gains', 'gain'),\n",
       " ('years', 'year'),\n",
       " ('prices', 'price'),\n",
       " ('houses', 'house'),\n",
       " ('flats', 'flat'),\n",
       " ('prices', 'price'),\n",
       " ('was', 'wa'),\n",
       " ('areas', 'area'),\n",
       " ('was', 'wa')]"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[(word, word.lemmatize()) for i, word in enumerate(parsed_body.words) \n",
    " if word != parsed_body.words[i].lemmatize()]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lemmatization relies on parts-of-speech (POS) tagging; `spaCy` performs POS tagging, here we make assumptions, e.g. that each token is verb."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:11:36.662746Z",
     "start_time": "2020-06-20T17:11:36.656596Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('prices', 'price'),\n",
       " ('dipped', 'dip'),\n",
       " ('has', 'have'),\n",
       " ('said', 'say'),\n",
       " ('has', 'have'),\n",
       " ('suggested', 'suggest'),\n",
       " ('housing', 'house'),\n",
       " ('is', 'be'),\n",
       " ('slowing', 'slow'),\n",
       " ('increases', 'increase'),\n",
       " ('prices', 'price'),\n",
       " ('figures', 'figure'),\n",
       " ('cooling', 'cool'),\n",
       " ('is', 'be'),\n",
       " ('figures', 'figure'),\n",
       " ('are', 'be'),\n",
       " ('weakening', 'weaken'),\n",
       " ('coming', 'come'),\n",
       " ('housing', 'house'),\n",
       " ('is', 'be'),\n",
       " ('slowing', 'slow'),\n",
       " ('is', 'be'),\n",
       " ('backed', 'back'),\n",
       " ('lending', 'lend'),\n",
       " ('said', 'say'),\n",
       " ('is', 'be'),\n",
       " ('lagging', 'lag'),\n",
       " ('figures', 'figure'),\n",
       " ('said', 'say'),\n",
       " ('mortgages', 'mortgage'),\n",
       " ('approved', 'approve'),\n",
       " ('has', 'have'),\n",
       " ('fallen', 'fall'),\n",
       " ('said', 'say'),\n",
       " ('prices', 'price'),\n",
       " ('increased', 'increase'),\n",
       " ('said', 'say'),\n",
       " ('prices', 'price'),\n",
       " ('rose', 'rise'),\n",
       " ('is', 'be'),\n",
       " ('predicting', 'predict'),\n",
       " ('prices', 'price'),\n",
       " ('stabilises', 'stabilise'),\n",
       " ('gains', 'gain'),\n",
       " ('attributed', 'attribute'),\n",
       " ('prices', 'price'),\n",
       " ('detached', 'detach'),\n",
       " ('houses', 'house'),\n",
       " ('said', 'say'),\n",
       " ('rose', 'rise'),\n",
       " ('prices', 'price'),\n",
       " ('had', 'have'),\n",
       " ('fallen', 'fall'),\n",
       " ('showed', 'show'),\n",
       " ('was', 'be'),\n",
       " ('was', 'be'),\n",
       " ('said', 'say'),\n",
       " ('showed', 'show'),\n",
       " ('followed', 'follow'),\n",
       " ('had', 'have'),\n",
       " ('rose', 'rise')]"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[(word, word.lemmatize(pos='v')) for i, word in enumerate(parsed_body.words) \n",
    " if word != parsed_body.words[i].lemmatize(pos='v')]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Sentiment & Polarity"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "TextBlob provides polarity and subjectivity estimates for parsed documents using dictionaries provided by the Pattern library. These dictionaries lexicon map adjectives frequently found in product reviews to sentiment polarity scores, ranging from -1 to +1 (negative ↔ positive) and a similar subjectivity score (objective ↔ subjective).\n",
    "\n",
    "The .sentiment attribute provides the average for each over the relevant tokens, whereas the .sentiment_assessments attribute lists the underlying values for each token"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:11:36.702603Z",
     "start_time": "2020-06-20T17:11:36.663931Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Sentiment(polarity=0.10447845804988663, subjectivity=0.44258786848072557)"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "parsed_body.sentiment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:11:36.708850Z",
     "start_time": "2020-06-20T17:11:36.703482Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Sentiment(polarity=0.10447845804988663, subjectivity=0.44258786848072557, assessments=[(['slightly'], -0.16666666666666666, 0.16666666666666666, None), (['average'], -0.15, 0.39999999999999997, None), (['recent'], 0.0, 0.25, None), (['strong'], 0.4333333333333333, 0.7333333333333333, None), (['likely'], 0.0, 1.0, None), (['overall'], 0.0, 0.0, None), (['down'], -0.15555555555555559, 0.2888888888888889, None), (['fairly'], 0.7, 0.9, None), (['nearly'], 0.1, 0.4, None), (['last'], 0.0, 0.06666666666666667, None), (['first'], 0.25, 0.3333333333333333, None), (['rose'], 0.6, 0.95, None), (['whole'], 0.2, 0.4, None), (['only'], 0.0, 1.0, None), (['second'], 0.0, 0.0, None), (['half'], -0.16666666666666666, 0.16666666666666666, None), (['overall'], 0.0, 0.0, None), (['large'], 0.21428571428571427, 0.42857142857142855, None), (['recent'], 0.0, 0.25, None), (['rose'], 0.6, 0.95, None), (['same'], 0.0, 0.125, None), (['average'], -0.15, 0.39999999999999997, None), (['more'], 0.5, 0.5, None), (['average'], -0.15, 0.39999999999999997, None), (['rose'], 0.6, 0.95, None), (['only'], 0.0, 1.0, None), (['slightly'], -0.16666666666666666, 0.16666666666666666, None), (['previous'], -0.16666666666666666, 0.16666666666666666, None)])"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "parsed_body.sentiment_assessments"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Combine Textblob Lemmatization with `CountVectorizer`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:11:36.717443Z",
     "start_time": "2020-06-20T17:11:36.709833Z"
    }
   },
   "outputs": [],
   "source": [
    "def lemmatizer(text):\n",
    "    words = TextBlob(text.lower()).words\n",
    "    return [word.lemmatize() for word in words]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-20T17:11:36.725875Z",
     "start_time": "2020-06-20T17:11:36.718506Z"
    }
   },
   "outputs": [],
   "source": [
    "vectorizer = CountVectorizer(analyzer=lemmatizer, decode_error='replace')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.7"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": true,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}